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Abstract 

This paper presents a novel approach to speaker subspace mod¬ 
elling based on Gaussian-Binary Restricted Boltzmann Ma¬ 
chines (GRBM). The proposed model is based on the idea of 
shared factors as in the Probabilistic Linear Discriminant Anal¬ 
ysis (PLDA). GRBM hidden layer is divided into speaker and 
channel factors, herein the speaker factor is shared over all vec¬ 
tors of the speaker. Then Maximum Likelihood Parameter Esti¬ 
mation (MLE) for proposed model is introduced. Various new 
scoring techniques for speaker verification using GRBM are 
proposed. The results for NIST i-vector Challenge 2014 dataset 
are presented. 

Index Terms: speaker recognition, speaker verification, Re¬ 
stricted Boltzmann Machines, i-vector, PLDA 

1. Introduction 

Actual approaches to text-independent automatic speaker ver¬ 
ification (ASV) generally focus on the modelling of speaker 
and channel variability. The background of majority of these 
methods is based on factorising of the long-term distribution 
of spectral features. The standard method in ASV is to model 
this distribution using Gaussian Mixture Model (GMM) which 
is trained on a large audio database and referred as Universal 
Background Model (UBM). The Joint Factor Analysis tech¬ 
nique m is based on decomposition of a UBM supervectof] 
into the additive components belonging to speaker and channel 
subspace. Speaker and channel subspaces are modeled using 
low-dimensional factors. The i-vector approach is based on the 
total variability model El representing supervector in the low¬ 
dimensional space containing both speaker and channel infor¬ 
mation. Probabilistic Linear Discriminant Analysis (PLDA) 0 
is applied to handle the influence of the channel variability in 
the i-vector space. PLDA deals with the decomposition of i- 
vectors on speaker and channel factors where the speaker factor 
is the same for all i-vectors of the speaker (4). 

In this paper we examine an alternative way to effectively 
model speaker subspace using Restricted Boltzmann Machines 
(RBM). The idea is close to the PLDA factor modelling and 
based on dividing RBM hidden layer into the speaker and 
channel factors where the speaker factor is shared over all 
vectors of the speaker. The proposed model uses Gaussian- 
Binary RBM (GRBM) in contrast to a model described in 
(5J where Gaussian-Gaussian RBM was considered. The pro¬ 
posed approach is simply extended to the case of Binary-Binary 
RBM (BRBM). This choice is motivated by the ability of us¬ 
ing Gaussian-Binary and Binary-Binary blocks as the internal 

1 A supervector is a vector of stacked GMM mean vectors 


parts of deeper architectures as Deep Belief Networks and Deep 
Boltzmann Machines. 

The paper is organized as follows. In section [2] the ba¬ 
sic definitions of GRBMs are covered, then GRBM with shared 
latent subspace and corresponding generative model is intro¬ 
duced, MLE for proposed model including modification of con¬ 
trastive divergence algorithm is performed. In section [2741 var¬ 
ious new scoring techniques for ASV are described including 
log-likelihood ratio (LLR) and normalized cosine scoring. In 
section [3] the train and test datasets are described. The results 
for NIST i-vector Challenge 2014 dataset are given and com¬ 
pared to the baseline and state-of-the-art methods. In section 
[4] conclusions and future work directions are discussed. In the 
appendix section[5]some theoretical proofs are presented. 

2. Shared latent subspace modelling within 
Gaussian-Binary Restricted Boltzmann 
Machines 

2.1. General GRBM 

GRBM defines probability density function (PDF) with input 
visible variable x and hidden (latent) variable h (UQ 

P(x,h) = ^ e ~ E{x ’ h) 

where Z is a normalizing constant called a partition function 
and E{x,h) is an energy function. For GRBM x is from 
the continuous space IRT and h is from the discrete space 
h € {0, l} r . E(x,h) depends on visible bias 6, hidden bias 
d, vector of standard deviations a and connectivity matrix W 

E(x,h ) = ill^ll 2 - d T h - (J ) T Wh 

Here and then */* denotes element-wise division of vectors, 
* 2 denotes element-wise squaring, T denotes transposition and 
| * || is Euclidean norm. 

2.2. GRBM with shared latent subspace 

GRBM is modified to simulate speaker and channel variability. 
The hidden variable is divided into the speaker factor s and the 
channel factor c, i.e. h = [s; c|. According to this, parameters 
are split into two groups, i.e. d = [/; g\, W = [T 1 , G\. Rewrite 
the energy function expression using split parameters 

E(x, s, c) = || 2 - f T s - g T c — (Fs + Gc ) 

The speaker factor is supposed to be the same for all i-vectors 
of one speaker while the channel factors are individual for each 





i-vector. Below a set of generative models depending on the 
number of i-vectors corresponding to the speaker and PDF for 
them are introduced. Consider the case of N i-vectors of the 
speaker and correspondent IV-order generative model. Denote 
speaker data as X = {xi,X2, ■ ■ ■ ,xi v}, channel factors as 
C = {ci, C 2 , • • ., Cjv} and speaker factor as s. PDF for N- 
order model is expressed as follows 

P N {X, s, C) = J_ e ~ E MNs,c) (1) 

Z/N 

where E N (X,s,C) = J2n=i E(x n , s, c„) and Z N = 
fxZ s c e - EMX ’ S ’ C) dX. The generative process for this 
model is shown in Figure [Tj First s, C are generating accord- 

Pn (a, C) 



Figure 1: The generative process for iV-order model 

ing the distribution P N {s, C) =f P N (X, s, C)dX then X is 
generating according the distribution Pn(X\s, C), where 

N 

P N (X\s,C) = HAfix^b + Fs + Gc^a 2 ) (2) 

n=1 

and A f denotes Gaussian distribution. 


2.3. Maximum likelihood parameter estimation 

Assume we have a labeled training set of K speakers, denoted 
by X = {Xk}k=i where Xk is data with Nk i-vectors that 
corresponds to fc-th speaker. Let Nk £ {2, 3, - - - ,M}, hence 
there are M — 1 generative models, and it is assumed that their 
parameters are tied. The aim is to estimate the set of parameters 
0 = {/, g , F, G, b, a} using MLE criterium that is standard 
approach for RBMs Q. For the optimization of MLE objective 
function we use a stochastic gradient descent approach that is 
widely used for RBMs QDED- Since data for each pair of speak¬ 
ers are assumed to be independent, normalized log-likelihood 
function takes the form of sum of log-likelihood functions for 
each generative model 


£norm(X|0) 


1 

T,k N k 


E^ Yfc i 0 ) = 


i 

Ylk Nk 


E E £ (^i 0 ) 


N= 2 k:N k =N 


Further speaker’s index will be neglected and there will be dis¬ 
cussed the likelihood of the data from one speaker. Denote 
speaker data as X = {xi, X 2 , ■ ■ ., attv} then 

£(X|0) = log Piv(X) (3) 

Denote the realization of speaker factor as s and the realizations 
of channel factors as c n , C = {ci, C 2 ,..., cjv}- Consider log- 
likelihood from 0 marginalizing 0 over all possible values of 
latent variables 

£(.Y|0) = log E] e~ EN< ' x ' s ' c ' > - log Z N (4) 

s,C 


Consider the first part of gradient of a. making the same trans¬ 
formations as for the general GRBM (3 


V e £i(X|0) 


J2 sC Pn(X,s,C)^E n (X,s,C) 
Pn( X) 

iir 


As a result, the gradient of 0 is represented as the following 
sum 


Ve£(X|0) = V e £i(X|0) —£ Pn (x) [v e £i(X|0)] (5) 


Flere £ denotes expectation: £p w (Y) [*] — fx Pn(X) * dX. 
The modification of the contrastive divergence algorithm m 
that enables to compute the second term of the gradient 0 is 
presented in section 12.3.11 Below the gradient of the first term 
will be considered. Taking into account the derivatives of en¬ 
ergy function 0 the gradient of £i(X|0) takes the following 
form 


Vp y £i(X|0) = P N ( Sj = l|A')^i (6) 

rr i 

V /( £ i(X|0) = NP N ( Sj = 1\X) (7) 

V Gi .£r(X|0) = YPN(c nj = 1\X)^ (8) 

n 

V 9i £i(X|0) = Y P»(cnj = 1|X) (9) 

n 

V bi C{X\Q) =(St - Nbi) (10) 


V Zi £i(X|0) = -^LY F ij p N(sj = 1|X) - (11) 

1 3 

- E Gij p N (c„j = i\x) + \ e bi)2 

n,j 1 n 1 

Flere and further x = ^2 n=1 x n and i,j denote indexing 
over dimensions. Additionally, instead of <t;, we update log- 
variances Zi = log of which are naturally constrained to stay 
positive |9]. Posteriori probabilities of latent factors from the 
expressions ®D are determined from the following relations, 
which are proved in the appendix section of the paper 

Pn(sj = 1\X) = sigm (n fj + {^/a 2 ) T F tj ^j (12) 

Pxicnj = 1|X) = sigm (gj + ( x n /a 2 ) T G t j^ (13) 

Here and further F*j and Gtj denotes respectively j-th column 
of the matrices and sigm (*) = l/(l + e - *). From the ex¬ 
pression it is clear that posterior for c„ depends only on 
x n and it is the same as for the general GRBM. The main dif¬ 
ference is that all speaker’s i-vectors X influence the speaker 
factor posterior 03- 

2.3.1. Contrastive divergence 

The modification of the contrastive divergence algorithm 0 is 
presented below. It enables to compute approximately the sec¬ 
ond part of the gradient 0. Expectation is replaced by mean 
over a finite set of samples from distribution Pn(X). Since 
it is hard to get these samples because of the complexity of the 
generative process, an approximate algorithm called the m-steps 















Figure 2: Contrastive divergence 


contrastive divergence m is applied. Algorithm scheme is pre¬ 
sented in Figure [2] Data of the speaker X is used to initialize 
the algorithm on the zero step. Intermediate k-th step of the al¬ 
gorithm is presented below. Reconstruction of visible data X k 
is sampled using 0. Latent variables s k , C k are sampled us¬ 
ing fl2l i and fill For binarization we use uniformly distributed 
random thresholds following recommendations from (7J. 

2.4. Scoring 

In this section various scoring strategies for GRBM with shared 
latent subspace will be presented. 


2.4.1. Log-likelihood scoring 

The LLR for a given verification trial {X, xt}, i.e. A is a set 
of N enrollment speaker’s vectors and xt is a test vector, is the 
LLR between target and non-target hypotheses.The target hy¬ 
pothesis is that the trial vectors share a common speaker factor, 
i.e. generated by N + 1-order model. Non-target hypothesis 
is that X is generated by A-order model and xt is independent 
from them and generated by 1-order model. 


I = log 


Pn+i(X, xt) 

P N (X)P 1 (x t ) 


The expression for the LLR score is given below and its proof 
is given in the appendix section 


l = J2 l ° g 


1 _|_ e ( N + 1 )fi + ((x+xt)/<r 2 ) T F„. 


+ 


ZnZ\ 


-l-log 


Zn+i 


Some methods exist for the approximate computation of the par¬ 
tition function ED- Note that values of the partition function 
do not influence the performance of the system in case when all 
speakers have the same number of enrollment vectors. 


2.4.2. Cosine scoring 

We apply the standard cosine scoring fl2il to i-vectors previ¬ 
ously projected onto the subspace F T . Denote y n = ^tY jj 

for each speaker’s i-vector from X and yt. = for 

test. The score is cosine between average speaker’s vector 
y sp = 'YtnVn/N and the test vector 


l 


COS 


Tjjsp_ 

Vt lly*p|| 


In addition to the general cosine score, we propose normalized 
cosine score l„ 0 rm. It takes into account information on the 


width of the speaker’s cluster that is lost in the standard cosine 
scoring. General cosine score is divided by the average cosine 
within the speaker’s set cos sp = Y, n Vn ||y SP || /^■ It can be 
shown that cos sp = ||y sp ||. Taking it into account, the expres¬ 
sion for the normalized cosine score takes the form 


2.4.3. PLDA on F-projected i-vectors 

PLDA model is trained on i-vectors projected onto the subspace 
F t and then projected on unit sphere - y n . PLDA handles 
residual channel variability using linear factor model {3). Scor¬ 
ing is done using the LLR for PLDA model 1X31114 il. 

3. Experimental results 

3.1. Dataset 

NIST i-vector Machine Learning Challenge 2014 dataset has 
been chosen to test the efficiency of the proposed model. The 
dataset consists of a labeled development set {devset), a labeled 
model set ( modelset ) with 5 i-vectors per model and an unla¬ 
beled test set ( testset ). Since labels for the devset were not 
available during the challenge, the best results were obtained 
from methods that allowed to cluster the devset and then to ap¬ 
ply PLDA lH5lfl6l . 

In our experiments we reformed the dataset. Preliminary all 
i-vectors with duration less then 10 seconds have been removed 
for their bad quality cmh). We construct a new labeled train- 
set, modelset , testset, modelsetCV , testsetCV. Speakers from de¬ 
vset with 3 to 10 i-vectors united with the initial modelset are 
assigned to the trainset, with 11 to 15 i-vectors are assigned 
to the new modelset and testset , remaining speakers with more 
then 15 i-vectors form cross validation set ( modelsetCV , test¬ 
setCV). First 5 i-vectors from each speaker’s set form enroll¬ 
ment in the modelset and the remaining form the testset. The 
same is done for the cross validation set. Eventually the train- 
set contains 3281 speakers and total 18759 i-vectors, 717 speak¬ 
ers with 3585 i-vectors and 5400 i-vectors in the modelset and 
the testset respectively. We used minDCF as a measure of the 
system performance and a measure for the cross validation pro¬ 
cessing 


minDCF = minFR(th) + 100 F A{th) 

th 

where FA and FR denote the false acceptance and the false 
rejection rates, and th the varying threshold. The trials con¬ 
sist of all possible pairs involving a target speaker set from the 
modelset and a test i-vector from the testset. 

3.2. Parameters estimation 

Whitened ED trainset is used for the parameter estimation. The 
parameters of whitening are computed on the trainset too. This 
transform is used further for all trials. We set initial biases /, 
g and b to the zero. Following the recommendations from [7l| 
elements of the connectivity matrices F and G are generated us¬ 
ing normal distribution with zero mean and standard deviation 
equal to 0.01. Elements of standard deviation vector a were set 
to 1.0. The case of a reestimation showed the worse results. 

The best performance was obtained using the speaker factor 
dimension equal to 500, the channel factor dimension equal to 
100 while i-vector dimension equal to 600. We used the mini¬ 
batch stochastic gradient descent algorithm |7J with learning 
















rate 0.01, momentum 0.5 and zero weight decay. Each batch 
contained 256 speakers. After each epoch the speakers are shuf¬ 
fled between batches. It took 40 epochs to achieve the best 
minDCF on the cross validation set. In case when all speak¬ 
ers belong to one batch it took 10 times more iterations to reach 
the same performance of the system. To train PLDA model on i- 
vectors, whitened trainset was projected on the unit sphere G3- 
It was found that the best speaker and channel factor dimensions 
for PLDA are equal to 590 and 10 respectively. PLDA model 
trained on i-vectors that were projected on F T has the speaker 
and channel factor dimensions equal to 499 and 1 respectively. 
Increase of the channel factor dimension showed the worse re¬ 
sults. 

3.3. Results 

We compare our algorithm with the NIST 2014 baseline cosine 
scoring and the state of the art 115 I6| PLDA. As the results 



Figure 3: Comparison of the proposed scoring algorithms with 
NIST 2014 baseline cosine and PLDA 



Baseline 

LLR 

Cos 

EER (in %) 

2.81 

1.68 

1.58 

minDCF 

0.210 

0.185 

0.167 

Norm cos 

F-proj PLDA 

i-vector PLDA 

Fuse 

1.43 

1.30 

1.51 

1.33 

0.145 

0.123 

0.114 

0.108 


Table 1: Results on NIST 2014 dataset. 


on i-vectors projected on speaker space F T of GRBM. In ad¬ 
dition, linear fuse Il8l of two PLDA models is presented. The 
first model uses i-vectors as features and the second one uses 
i-vectors projected on F T . Coefficients of the fuse were esti¬ 
mated on the cross validation set by using logistic regression 
training with weighted MLE criterium m. As can be seen in 
Figure[3] the fused scores outperform i-vector PLDA in the area 
of low FA and retain performance in the EER area. 

4. Conclusions and Further Work 

We used shared latent subspace in GRBM hidden layer to sep¬ 
arate speaker dependent and speaker independent factors in i- 
vector space. Approximate maximum likelihood parameters 
estimation is presented. For the proposed model several scor¬ 
ing methods for the speaker verification were considered , in¬ 
cluding a novel log-likelihood scoring and normalized cosine 
scoring. PLDA operating with i-vectors projected on GRBM 
speaker space performed results that are comparable to the state 
of the art i-vector PLDA approach. Fuse of these two PLDA 
models showed the best results at all operating points. 

In further work, the method of projection on GRBM 
speaker space can be viewed as a stand-alone channel variability 
compensation technique. GRBM with shared latent subspace 
can be extended to the other types of RBM and can be used as 
a block in deeper architectures. 


5. Appendix 


In this section proofs of LLR score expression from section 
12.41 and expressions G2j, m are derived. They can be ob¬ 
tained if there is an expression for a posterior probability 
Pn(s, C\X). First we derive joint PDF for latent variables and 
data Pjv ( s,C,X ) using its definition (jT} 

Pn(s,C,X) = Cn,x ■ (14) 

TT gtV f i s i + (x/<T 2 ') T F,iSi gSjO n j + (a:n/CT 2 ) T G,jC n:j 
i,n,j 

Here Cn,x = ° ^ ■ Marginalizing i ll4b over 

all possible latent variables we have 


Pn{X) = Cn,x ■ 


n 



e Nf i + {x/cr 2 ) T 



1 _|_ gSi + OWs - 2 ) 1 



Eventually the posterior probability of latent variables is the di¬ 
vision of 03 on 01 


Pn(s,C\X) = 


(16) 


n 


e Nf iSi + (x/a 2 ) T F,iSi 

1 + e Nfi + (*/° 2 ) T F,i 


n 


^jCnj + ^Xn/lT 2 ^ G,jC n:j 

1 _|_ g9j+0W‘r 2 ) T- G„ 3 - 


Now expressions for 03- 01 can be obtained by summing 
01 over corresponding latent variables. The expression for 
LLR score is obtained by applying d to the three trial subsets. 


in Table 0] and Figure [3] demonstrate, all scoring strategies per¬ 
form better then challenge baseline. Despite the optimality of 
log-likelihood GRBM scoring, it did not show the best results 
among the other GRBM scoring strategies. Perhaps, this is due 
to the specific of the i-vector data. The considered normalized 
cosine scoring performs better then the standard cosine scoring. 
In terms of EER, the best result is achieved on PLDA trained 
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